class: center, middle, inverse, title-slide # Lecture 9 ## Multiple Groups: Additional models ### Psych 10 C ### University of California, Irvine ### 04/15/2022 --- ## Final Project - As you already know, we will have a final project for this class instead of a final exam. -- - This final project is worth 50% of your total grade and it should be done in groups of between 3 and 4 students. -- - The objective is to have you write a short version of what a scientific paper should look like. Including the data analysis section. -- - To make things "easier" for you, we have selected 3 problems with their corresponding data sets that you can choose from for your final project. -- - To answer the research question of each project you will need a different approach, not all problems can be solved the same way. Therefore, before we look at the data we want you to choose one of the three problems. --- ### Problem 1: - We want to know if people that have had a brain curgical procedure known as split brain, have problems identifing obhects presented at either side of their visual field. -- ### Problem 2: - We want to know if the cognitive decline of a person from one year to the next is associated with their level of social interaction, physical activity and the fact that they used the app Luminosity during that year. -- ### Problem 3: - We want to know how the asking price for an IKEA chair changes as a function of the total cost of the matrials, if the person build the chaitr themselves and the difficulty of building it. --- ## Final project The paper must have the following points: - Introduction: 4 to 5 sentences that summarize the problem, and why it's important. The idea is to convince your reader that what you are doing matters you can use references to previous work (1 or 2 max). -- - Methods: In one paragraph, explain what is in the data set and where it came from (how it was collected). -- - Data: provide a summary of your data, you can use summary statistics like the median, mean or variance. You can also use this section to highlight some properties of the data that you are working on by using graphs. --- ## Final project - Model comparison: Describe the models that you will be comparing using your current data and their main assumptions (does the model assume that groups are equal?). - You should provide a summary of your analysis, a good way to do so can be to have a table that has the SSE of each model that you tested and the BIC associated to this model. -- - Discussion: - One paragraph on: What conclusions can you draw from the model comparisons about the experimental question. -- - One paragraph on: What might be the broader implications of this finding? -- - One paragraph on: What are the limitations of this study and/or what are the gaps left by this study? Be sure to explain why the limitation could impact the results. --- class: inverse, center, middle # Review of BIC's --- ## BIC - Last week we mentioned that regardless of what data we have, a model that assumes that there is more that one distribution will always be better in terms of the error it produces in comparison to a simpler model. -- - We can try to understand this intuitively by looking a graph that uses a single mean as a prediction. To do this we will use our smokers data: -- <img src="data:image/png;base64,#lec-9_files/figure-html/unnamed-chunk-1-1.png" style="display: block; margin: auto;" /> --- ## BIC - If we use two predictions instead of one (as in the effects model) the error will be reduced. -- <img src="data:image/png;base64,#lec-9_files/figure-html/unnamed-chunk-2-1.png" style="display: block; margin: auto;" /> --- ## BIC - In general, we know that the more parameters (lines in the previous example) our models have, the more they will resemble the values in our data (less error). -- - This is known as "fit", and for linear models adding more parameters or unknowns (like adding more values of `\(\mu\)`) to our models will always improve their fit (how close our predictions are to our observations). -- - If we only use "fit" or error as a criterion of which theory is better, then we could always add more an more lines to the graph until we get a single line for each point. -- - However, this is not be useful as it tells us that in order to make a prediction we need to know the value that we want to predict. This doesn't make sense. --- ## BIC - To avoid this problem we would like to take into account two things: -- 1. How many parameters (unknown values) we use to make predictions (model complexity). -- 2. How close those predictions are to those values (model fit). -- - BIC is one of the methods (there are more) that can accomplish this objective, and it does so by assigning a value to how bad the predictions of the model are (how large is the total error) and how many parameters (unknown values) the model uses to make it's predictions. -- - The BIC will assign higher values to models that make larger errors or that need more parameters (unknown values) to make a prediction. -- - If we select the model with the lowest BIC from the ones that we are using then we are making a decision that weights both things, the complexity of the model and the accuracy of the model (fit). --- class: inverse, center, middle # Multiple Groups --- ## Comparison between multiple groups - Last class we started looking at an example where we had the level of anxiety of 3 cohorts of students at a university. The 2018, 2019 and 2020 cohorts. -- - We formalized a model that assumed that there are no difference on the levels of anxiety of First year students on the 3 cohorts (Null model) that states: `$$y_{ij}\sim\text{Normal}(\mu,\sigma_0^2)$$` for `\(i=1, \dots,30\)` students in each of the `\(j=1,2,3\)` cohorts (2018 to 2020). -- - The second model assumes that the levels of anxiety of First year students in one of the three cohorts is different from the rest. Formally, this model is expressed as: `$$y_{ij}\sim\text{Normal}(\mu_j,\sigma_e^2)$$` -- - We also mentioned that this model does not address our original question, as the conclusion that we can draw from it is that there is at least one of the groups is different from the rest. --- ## Results: review - The estimations for the Null model are: .pull-left[ ```r n_total <- nrow(anxiety) anxiety <- anxiety %>% mutate("null_pred" = mean(anxiety), "null_error" = (anxiety - null_pred)^2) sse_0 <- sum(anxiety$null_error) mse_0 <- 1/n_total * sse_0 ``` ] .pull-right[ Prediction: 10.08 SSE: 598.46 Mean SE: 6.65 ] --- ## Results: review - The estimations from the Effects model are: .pull-left[ ```r mean_groups <- anxiety %>% group_by(cohort) %>% summarise("pred" = mean(anxiety)) anxiety <- anxiety %>% mutate("eff_pred" = case_when(cohort == "2018" ~ mean_groups$pred[1], cohort == "2019" ~ mean_groups$pred[2], cohort == "2020" ~ mean_groups$pred[3]), "eff_error" = (anxiety - eff_pred)^2) sse_e <- sum(anxiety$eff_error) mse_e <- 1/n_total * sse_e ``` ] .pull-right[ Prediction: - 2018: 9.17 - 2019: 8.93 - 2020: 12.13 SSE: 407.5 Mean SE: 4.53 ] --- ## Results: review - The conclusion that we draw from those estimations was that: -- 1. The effects model accounted for 31.9% of the total variation in anxiety scores of the students on the 3 different cohorts. -- 2. According to the BIC values, we should select the Effects model (149.42) over the Null (175.01). This means, that at least one of the 3 groups is different from the others. -- - However, our original problem stated that we wanted to know if there where any differences between the cohorts, and not that there was at least one cohort that was different. -- - To solve this problem we have to look at three additional models. --- ## Group models - From the results of the comparison between the Null and Effects model we know that there is at least one cohort of students that has a different anxiety level in comparison to the others. -- - In order to find out which group is different, we can build three new models that assume that a single cohort is different for the rest. -- - Notice that making the comparison between Null and Effects model at the start will save us a lot of work if it turns out that the best model is the Null. --- ## Three models - We need 3 models that formalize 3 assumptions: -- 1. Students in the 2018 cohort are different from students in the 2019 and 2020 cohorts. -- 2. Students in the 2019 cohort are different from students in the 2018 and 2020 cohorts. -- 3. Students in the 2020 cohort are different from students in the 2018 and 2019 cohorts. -- - To make this easy to remember we will refer to each model using the year associated with the cohort that is **different** from the rest. (e.g the "2018 model" is the one that assumes that students in the 2018 cohort are different from students in the other 2). --- ## 2018 model - We can formalize our "2018 model" using the following notation: `$$y_{i1} \sim \text{Normal}(\mu_1,\sigma_1^2)$$` `$$y_{i2},y_{i3} \sim \text{Normal}(\mu,\sigma_1^2)$$` -- - Notice that only the expected value of the anxiety level of students in the 2018 cohort `\((\mu_1)\)` is different from the other two `\((\mu)\)`. -- - Our estimators for `\(\mu_1\)`, `\(\mu\)` and `\(\sigma_1^2\)` will be similar to the ones we had before. -- - First the estimator `\(\hat{\mu}_1\)` will be the average anxiety level in the 2018 cohort, while the estimator for the other 2 cohorts `\(\hat{\mu}\)` will be the average anxiety level of all students on the 2019 and 2020 cohorts. --- ## Adding the predictions of the 2018 model - An easy way to calculate and add the predictions of this new model to our data is by first creating an "indicator" variable. This is a variable that takes one value if the observation belongs to the 2018 cohort and another value if it belongs to one of the others. ```r anxiety <- anxiety %>% mutate("id_2018" = ifelse(test = cohort == "2018", yes = "2018", no = "other")) ``` -- - Then we can use that variable to group our data and get the averages that we need. -- ```r pred_2018 <- anxiety %>% group_by(id_2018) %>% summarise("pred" = mean(anxiety)) ``` -- - The predicted anxiety level for First year students in the 2018 cohort was 9.17 and for students in the 2019 and 2020 cohorts it was 10.53. - Using those two estimators as the predictions of the anxiety levels we can calculate the error of the model. Again, we start by calculating the error for each observation by taking the observation and subtracting the prediction ( `\(\mu_1\)` and `\(\mu\)`, respectively) and then squaring that difference.